17 research outputs found

    Polyphonic Sound Event Detection by using Capsule Neural Networks

    Full text link
    Artificial sound event detection (SED) has the aim to mimic the human ability to perceive and understand what is happening in the surroundings. Nowadays, Deep Learning offers valuable techniques for this goal such as Convolutional Neural Networks (CNNs). The Capsule Neural Network (CapsNet) architecture has been recently introduced in the image processing field with the intent to overcome some of the known limitations of CNNs, specifically regarding the scarce robustness to affine transformations (i.e., perspective, size, orientation) and the detection of overlapped images. This motivated the authors to employ CapsNets to deal with the polyphonic-SED task, in which multiple sound events occur simultaneously. Specifically, we propose to exploit the capsule units to represent a set of distinctive properties for each individual sound event. Capsule units are connected through a so-called "dynamic routing" that encourages learning part-whole relationships and improves the detection performance in a polyphonic context. This paper reports extensive evaluations carried out on three publicly available datasets, showing how the CapsNet-based algorithm not only outperforms standard CNNs but also allows to achieve the best results with respect to the state of the art algorithms

    Deep Recurrent Neural Network-Based Autoencoders for Acoustic Novelty Detection

    Get PDF
    In the emerging field of acoustic novelty detection, most research efforts are devoted to probabilistic approaches such as mixture models or state-space models. Only recent studies introduced (pseudo-)generative models for acoustic novelty detection with recurrent neural networks in the form of an autoencoder. In these approaches, auditory spectral features of the next short term frame are predicted from the previous frames by means of Long-Short Term Memory recurrent denoising autoencoders. The reconstruction error between the input and the output of the autoencoder is used as activation signal to detect novel events. There is no evidence of studies focused on comparing previous efforts to automatically recognize novel events from audio signals and giving a broad and in depth evaluation of recurrent neural network-based autoencoders. The present contribution aims to consistently evaluate our recent novel approaches to fill this white spot in the literature and provide insight by extensive evaluations carried out on three databases: A3Novelty, PASCAL CHiME, and PROMETHEUS. Besides providing an extensive analysis of novel and state-of-the-art methods, the article shows how RNN-based autoencoders outperform statistical approaches up to an absolute improvement of 16.4% average F-measure over the three databases

    Deep Learning for Sound Event Detection and Classification

    No full text
    I recenti progressi riguardanti l’elaborazione del segnale acustico e le tecniche di machine learning hanno permesso lo sviluppo di tecnologie innovative per l’ana- lisi automatica di eventi sonori. In particolare, uno degli approcci attualmente piu` in voga in questo ambito consiste nell’impiego di tecniche di Deep Learning (DL). Tradizionalmente, tali algoritmi si basavano su tecniche di di modellazio- ne statistica come i Gaussian Mixture Models, gli Hidden Markov Models o le Support Vector Machines, ma il recente ritorno di interesse verso gli strumenti di apprendimento automatico come il DL ha condotto a risultati incoraggianti. Questa tesi riporta uno stato dell’arte aggiornato e propone diversi metodi basati su deep neural networks (DNN) per il Sound Event Detection (SED) ed il Sound Event Classification (SEC), congintamente ad una panoramica sulle procedure e le metriche di valutazione utilizzate in questo campo di ricerca. In particolare, la tendenza recente mostra un ampio impiego di reti neurali di tipo convoluzionale (CNN) per il SED ed il SEC. Questo lavoro include anche approcci innovativi basati sull’architettura DNN siamese o sulle nuove unita` computazionali chiamate Capsule. La maggior parte dei sistemi sono stati pro- gettati in occasione di challenge internazionali. Cio` ha consentito l’accesso a dataset pubblici e la possibilita` di confrontare su una base comune le prestazioni dei sistemi proposti dai team di ricerca piu` competitivi. I casi di studio riportati fanno riferimento ad applicazioni rivolte ad una gran- de varieta` di scenari che includono tra gli altri la diagnosi non invasiva, il monitoraggio bio-acustico e la classificazione delle condizioni della superficie stradale. Tra le complessita` a cui si deve fare fronte per permettere l’applicazione questi sistemi in ambienti reali vi sono lo sbilanciamento dei dataset, diversi setup di acquisizione, eventuali disturbi acustici e la polifonia. In particolare, un algoritmo per il SED polifonico puo` essere considerato come un sistema in grado di eseguire contemporaneamente il rilevamento e la classificazione degli eventi che si verificano nel flusso audio.The recent progress on acoustic signal processing and machine learning techniques have enabled the development of innovative technologies for automatic analysis of sound events. In particular, nowadays one of the hottest approach to this problem lays on the exploitation of Deep Learning techniques. As further proof, in several occasion neural architectures originally designed for other multimedia domains have been successfully proposed to process the audio signal. Indeed, although these technologies have been faced for a long time by statistical modelling algorithms such as Gaussian Mixture Models, Hidden Markov Models or Support Vector Machines, the new breakthrough of machine learning for audio processing has lead to encouraging results into the addressed tasks. Hence, this thesis reports an up-to-date state of the art and proposes several reliable DNN-based methods for Sound Event Detection (SED) and Sound Event Classification (SEC), with an overview of the Deep Neural Network (DNN) architectures used on purpose and of the evaluation procedures and metrics used in this research field. According to the recent trend, which shows an extensive employment of Convolutional Neural Networks (CNNs) for both SED and SEC tasks, this work reports also rather new approaches based on the Siamese DNN architecture or the novel Capsule computational units. Most of the reported systems have been designed in the occasion of international challenges. This allowed the access to public datasets, and to compare systems proposed by the most competitive research teams on a common basis. The case studies reported in this dissertation refer to applications in a variety of scenarios, ranging from unobtrusive health monitoring, audio-based surveillance, bio-acoustic monitoring and classification of the road surface conditions. These tasks face numerous challenges, particularly related to their application in real-life environments. Among these issues there are unbalancing of datasets, different acquisition setups, acoustic disturbance (i.e., background noise, reverberation and cross-talk) and polyphony. In particular, since multiple events are very likely to overlap in real life audio, two algorithms for polyphonic SED are reported in this thesis. A polyphonic SED algorithm can be considered as system which is able to perform contemporary detection - determining onset and offset time of the sound events - and classification - assigning a label to each of the events occurring in the audio stream

    Acoustic novelty detection with adversarial autoencoders

    No full text
    Novelty detection is the task of recognising events the differ from a model of normality. This paper proposes an acoustic novelty detector based on neural networks trained with an adversarial training strategy. The proposed approach is composed of a feature extraction stage that calculates Log-Mel spectral features from the input signal. Then, an autoencoder network, trained on a corpus of 'normal' acoustic signals, is employed to detect whether a segment contains an abnormal event or not. A novelty is detected if the Euclidean distance between the input and the output of the autoencoder exceeds a certain threshold. The innovative contribution of the proposed approach resides in the training procedure of the autoencoder network: instead of using the conventional training procedure that minimises only the Minimum Mean Squared Error loss function, here we adopt an adversarial strategy, where a discriminator network is trained to distinguish between the output of the autoencoder and data sampled from the training corpus. The autoencoder, then, is trained also by using the binary cross-entropy loss calculated at the output of the discriminator network. The performance of the algorithm has been assessed on a corpus derived from the PASCAL CHiME dataset. The results showed that the proposed approach provides a relative performance improvement equal to 0.26% compared to the standard autoencoder. The significance of the improvement has been evaluated with a one-tailed z-test and resulted significant with p < 0.001. The presented approach thus showed promising results on this task and it could be extended as a general training strategy for autoencoders if confirmed by additional experiments

    Rima Glottidis: Experimenting Generative Raw Audio Synthesis for a Sound Installation

    No full text
    Biologically-inspired algorithms such as artificial neural networks have been used extensively by computer music researchers, for generative music and algorithmiccomposition. The recent introduction of raw audio Machine Learning (ML) techniques, however, represents a significant leap because they seem to be able to learnboth high-level (event) and low-level (timbre) information at once. Employing such techniques for creative purposes is very challenging at this early stage sincethere is lack of method, experience, and their computational cost is very high. In this paper, we describe the technical and creative process behindRima Glottidis,an installation based on material from the homonym musical work from the artistØkapi,processedusingrawaudioMLtechniquesforaninstallationpremieredatBlooming festival, Italy. Technical issues are described and critical aspects are reported as a reference for future projects

    Few-Shot Siamese Neural Networks Employing Audio Features for Human-Fall Detection

    No full text
    Nowadays, the detection of human fall is a problem recognized by the entire scientific community. Methods that have good performance use human falls samples in the train set, while methods that do not use it, can only work well under certain conditions. Since examples of human falls are very difficult to retrieve, there is a strong need to develop systems that can work well event with few or no data to be used for their training phase. In this article, we show a first study on few-shot learning Siamese Neural Network applied to human falls detection by using audio signals. This method has been compared with algorithms based on SVM and OCSVM, all evaluated starting from the same conditions. The proposed approach is able to learn the differences between signals belonging to different classes of events. In classification phase, using only one human fall signal as a template, it achieves about 80% of F1 -Measure related to the human fall class, while the SVM based method gets around 69%, when it is trained in the same data knowledge conditions

    Localizing speakers in multiple rooms by using Deep Neural Networks

    No full text
    In the field of human speech capturing systems, a fundamental role is played by the source localization algorithms. In this paper a Speaker Localization algorithm (SLOC) based on Deep Neural Networks (DNN) is evaluated and compared with state-of-the art approaches. The speaker position in the room under analysis is directly determined by the DNN, leading the proposed algorithm to be fully data-driven. Two different neural network architectures are investigated: the Multi Layer Perceptron (MLP) and Convolutional Neural Networks (CNN). GCC-PHAT (Generalized Cross Correlation-PHAse Transform) Patterns, computed from the audio signals captured by the microphone are used as input features for the DNN. In particular, a multi-room case study is dealt with, where the acoustic scene of each room is influenced by sounds emitted in the other rooms. The algorithm is tested by means of the home recorded DIRHA dataset, characterized by multiple wall and ceiling microphone signals for each room. In detail, the focus goes to speaker localization task in two distinct neighboring rooms. As term of comparison, two algorithms proposed in literature for the addressed applicative context are evaluated, the Crosspower Spectrum Phase Speaker Localization (CSP-SLOC) and the Steered Response Power using the Phase Transform speaker localization (SRP-SLOC). Besides providing an extensive analysis of the proposed method, the article shows how DNN-based algorithm significantly outperforms the state-of-the-art approaches evaluated on the DIRHA dataset, providing an average localization error, expressed in terms of Root Mean Square Error (RMSE), equal to 324 mm and 367 mm, respectively, for the Simulated and the Real subsets

    Deep neural networks for Multi-Room Voice Activity Detection: Advancements and comparative evaluation

    No full text
    This paper focuses on Voice Activity Detectors (VAD) for multi-room domestic scenarios based on deep neural network architectures. Interesting advancements are observed with respect to a previous work. A comparative and extensive analysis is lead among four different neural networks (NN). In particular, we exploit Deep Belief Network (DBN), Multi-Layer Perceptron (MLP), Bidirectional Long Short-Term Memory recurrent neural network (BLSTM) and Convolutional Neural Network (CNN). The latter has recently encountered a large success in the computational audio processing field and it has been successfully employed in our task. Two home recorded datasets are used in order to approximate real-life scenarios. They contain audio files from several microphones arranged in various rooms, from whom six features are extracted and used as input for the deep neural classifiers. The output stage has been redesigned compared to the previous author's contribution, in order to take advantage of the networks discriminative ability. Our study is composed by a multi-stage analysis focusing on the selection of the features, the network size and the input microphones. Results are evaluated in terms of Speech Activity Detection error rate (SAD). As result, a best SAD equal to 5.8% and 2.6% is reached respectively in the two considered datasets. In addiction, a significant solidity in terms of microphone positioning is observed in the case of CNN

    A neural network based algorithm for speaker localization in a multi-room environment

    No full text
    A Speaker Localization algorithm based on Neural Networks for multi-room domestic scenarios is proposed in this paper. The approach is fully data-driven and employs a Neural Network fed by GCC-PHAT (Generalized Cross Correlation Phase Transform) Patterns, calculated by means of the microphone signals, to determine the speaker position in the room under analysis. In particular, we deal with a multi-room case study, in which the acoustic scene of each room is influenced by sounds emitted in the other rooms. The algorithm is tested against the home recorded DIRHA dataset, characterized by multiple wall and ceiling microphone signals for each room. In particular, we focused on the speaker localization problem in two distinct neighbouring rooms. We assumed the presence of an Oracle multi-room Voice Activity Detector (VAD) in our experiments. A three-stage optimization procedure has been adopted to find the best network configuration and GCC-PHAT Patterns combination. Moreover, an algorithm based on Time Difference of Arrival (TDOA), recently proposed in literature for the addressed applicative context, has been considered as term of comparison. As result, the proposed algorithm outperforms the reference one, providing an average localization error, expressed in terms of RMSE, equal to 525 mm against 1465 mm. Concluding, we also assessed the algorithm performance when a real VAD, recently proposed by some of the authors, is used. Even though a degradation of localization capability is registered (an average RMSE equal to 770 mm), still a remarkable improvement with respect to the state of the art performance is obtained
    corecore